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DPCube: Differentially Private Histogram 
Release through Multidimensional Partitioning 

Yonghui Xiao, Li Xiong, Liyue Fan and Slawomir Goryczka 

Abstract — Differential privacy is a strong notion for protecting individual privacy in privacy preserving data analysis or 
publishing. In this paper, we study the problem of differentially private histogram release for random workloads. We study two 
multidimensional partitioning strategies including: 1) a baseline cell-based partitioning strategy for releasing an equi-width cell 
histogram, and 2) an innovative 2-phase kd-tree based partitioning strategy for releasing a v-optimal histogram. We formally 
analyze the utility of the released histograms and quantify the errors for answering linear queries such as counting queries. 
We formally characterize the property of the input data that will guarantee the optimality of the algorithm. Finally we implement 
and experimentally evaluate several applications using the released histograms, including counting queries, classification, and 
blocking for record linkage and show the benefit of our approach. 

Index Terms — Differential privacy, non-interactive data release, histogram, classification, record linkage. 
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As information technology enables the collection, stor- 
age, and usage of massive amounts of information 
about individuals and organizations, privacy becomes 
an increasingly important issue. Governments and 
organizations recognize the critical value in sharing 
such information while preserving the privacy of 
individuals. Privacy preserving data analysis and data 
publishing fS], [jH, ||5l has received considerable atten- 
tion in recent years. There are two models for privacy 
protection |3|: the interactive model and the non- 
interactive model. In the interactive model, a trusted 
curator (e.g. hospital) collects data from record owners 
(e.g. patients) and provides an access mechanism for 
data users (e.g. public health researchers) for querying 
or analysis purposes. The result returned from the 
access mechanism is perturbed by the mechanism 
to protect privacy. In the non-interactive model, the 
curator publishes a "sanitized" version of the data, 
simultaneously providing utility for data users and 
privacy protection for the individuals represented in 
the data. 

Differential privacy El, 0, IS, 13, H is widely 
accepted as one of the strongest known privacy guar- 
antees with the advantage that it makes few as- 
sumptions on the attacker's background knowledge. 
It requires the outcome of computations to be for- 
mally indistinguishable when run with or without 
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any particular record in the dataset, as if it makes 
little difference whether an individual is being opted 
in or out of the database. Many meaningful results 
have been obtained for the interactive model with 
differential privacy |6|, |7|, |3J, |5|. Non-interactive 
data release with differential privacy has been recently 
studied with hardness results obtained and it remains 
an open problem to find efficient algorithms for many 
domains S, QOl. 
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Fig. 1 . Differentially private histogram release 

In this paper, we study the problem of differen- 
tially private histogram release based on a differential 
privacy interface, as shown in Figure |6] A histogram 
is a disjoint partitioning of the database points with 
the number of points which fall into each partition. 
A differential privacy interface, such as the Privacy 
INtegrated Queries platform (PINQ) [11 1, provides 
a differentially private access to the raw database. 
An algorithm implementing the partitioning strategy 
submits a sequence of queries to the interface and 
generates a differentially private histogram of the raw 
database. The histogram can then serve as a sanitized 
synopsis of the raw database and, together with an 
optional synthesized dataset based on the histogram, 
can be used to support count queries and other types 
of OLAP queries and learning tasks. 

An immediate question one might wonder is what 
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is the advantage of the non-interactive release com- 
pared to using the interactive mechanism to answer 
the queries directly. A common mechanism providing 
differentially private answers is to add carefully cali- 
brated noise to each query determined by the privacy 
parameter and the sensitivity of the query. The com- 
posability of differential privacy iriTII ensures privacy 
guarantees for a sequence of differentially-private 
computations with additive privacy depletions in the 
worst case. Given an overall privacy requirement or 
budget, expressed as a privacy parameter, it can be 
allocated to subroutines or each query in the query 
sequence to ensure the overall privacy. When the 
number of queries grow, each query gets a lower pri- 
vacy budget which requires a larger noise to be added. 
When there are multiple users, they have to share a 
common privacy budget which degrades the utility 
rapidly. The non-interactive approach essentially ex- 
ploits the data distribution and the query workload 
and uses a carefully designed algorithm or query 
strategy such that the overall noise is minimized for a 
particular class of queries. As a result, the partitioning 
strategy and the algorithm implementing the strategy 
for generating the query sequence to the interface 
are crucial to the utility of the resulting histogram or 
S5mthetic dataset. 

Contributions. We study differentially private his- 
togram release for random query workload in this 
paper and propose partitioning and estimation algo- 
rithms with formal utility analysis and experimental 
evaluations. We summarize our contributions below. 

• We study two multidimensional partitioning 
strategies for differentially private histogram re- 
lease: 1) a baseline cell-based partitioning strategy 
for releasing an equi-width cell histogram, and 
2) an innovative 2-phase kd-tree (k-dimensional 
tree) based space partitioning strategy for releas- 
ing a v-optimal histogram. There are several inno- 
vative features in our 2-phase strategy. First, we 
incorporate a uniformity measure in the partition- 
ing process which seeks to produce partitions that 
are close to uniform so that approximation er- 
rors within partitions are minimized, essentially 
resulting in a differentially private v-optimal his- 
togram. Second, we implement the strategy using 
a two-phase algorithm that generates the kd-tree 
partitions based on the cell histogram so that 
the access to the differentially private interface 
is minimized. 

• We formally analyze the utility of the released 
histograms and quantify the errors for answer- 
ing linear distribute queries such as counting 
queries. We show that the cell histogram provides 
bounded query error for any input data. We 
also show that the v-optimal histogram combined 
with a simple query estimation scheme achieves 
bounded query error and superior utility than 



existing approaches for "smoothly" distributed 
data. We formally characterize the "smoothness" 
property of the input data that guarantees the 
optimality of the algorithm. 
• We implement and experimentally evaluate sev- 
eral applications using the released histograms, 
including counting queries, classification, and 
blocking for record linkage. We compare our 
approach with other existing privacy-preserving 
algorithms and show the benefit of our approach. 

2 Related works 

Privacy preserving data analysis and publishing has 
received considerable attention in recent years. We 
refer readers to ||3l, lH, for several up-to-date 
surveys. We briefly review here the most relevant 
work to our paper and discuss how our work differs 
from existing work. 

There has been a series of studies on interactive 
privacy preserving data analysis based on the notion 
of differential privacy l|6l, ||7|, JSl, II3- A primary 
approach proposed for achieving differential privacy 
is to add Laplace noise [6J, [3J, [7J to the original 
results. McSherry and Talwar [12 J give an alterna- 
tive method to implement differential privacy based 
on the probability of a returned result, called the 
exponential mechanism. Roth and Roughgarden flSl 
proposes a median mechanism which improves upon 
the Laplace mechanism. McSherry implemented the 
interactive data access mechanism into PINQ [11 J, a 
platform providing a programming interface through 
a SQL -like language. 

A few works started addressing non-rnteractive 
data release that achieves differential privacy. Blum 
et al. [9 1 proved the possibility of non-interactive data 
release satisfying differential privacy for queries with 
pol5momial VC-dimension, such as predicate queries. 
It also proposed an inefficient algorithm based on the 
exponential mechanism. The result largely remains 
theoretical and the general algorithm is inefficient for 
the complexity and required data size. | |T0| further 
proposed more efficient algorithms with hardness 
results obtained and it remains a key open problem 
to find efficient algorithms for non-interactive data re- 
lease with differential privacy for many domains. 1141 
pointed out that a natural approach to side-stepping 
the hardness is relaxing the utility requirement, and 
not requiring accuracy for every input database. 

Several recent work studied differentially private 
mechanisms for particular kinds of data such as 
search logs [15], |23 or set-valued data IflTl . Others 
proposed algorithms for specific applications or op- 
timization goals such as recommender systems IITSl . 
record linkage [19J, data mining [20J, or differentially 
private data cubes with minimized overall cuboid 
error [21 1. It is important to note that IIT9ll uses several 
tree strategies including kd-tree in its partitioning step 



and our results show that our 2-phase uniformity- 
driven kd-tree strategy achieves better utility for ran- 
dom count queries. 

A few works considered releasing data for predic- 
tive count queries and are closely related to ours. 
X. Xiao et al. [22J developed an algorithm using 
wavelet transforms. Il23l generates differentially pri- 
vate histograms for single dimensional range queries 
through a hierarchical partitioning approach and a 
consistency check technique. [24J proposes a query 
matrix mechanism that generates an optimal query 
strategy based on the query workload of linear count 
queries and further mapped the work in ||22| and ||23J 
as special query strategies that can be represented 
by a query matrix. It is worth noting that the cell- 
based partitioning in our approach is essentially the 
identity query matrix referred in [24 1. While we will 
leverage the query matrix framework to formally 
analyze our approach, it is important to note that the 
above mentioned query strategies are data-oblivious 
in that they are determined by the query workload, a 
static wavelet matrix, or hierarchical matrix without 
taking into consideration the underlying data. On the 
other hand, our 2-phase kd-tree based partitioning 
is designed to explicitly exploit the smoothness of 
the underlying data indirectly observed by the dif- 
ferentially private interface and the final query matrix 
corresponding to the released histogram is dependent 
on the approximate data distribution. We are also 
aware of a forthcoming work [25 1 and will compare 
with it as the proceedings becomes available. 

In summary, our work complements and advances 
the above works in that we focus on differentially 
private histogram release for random query workload 
using a multidimensional partitioning approach that 
is "data-aware". Sharing the insights from [14], [17], 
our primary viewpoint is that it is possible and desir- 
able in practice to design adaptive or data-dependent 
heuristic mechanisms for differentially private data 
release for useful families or subclasses of databases 
and applications. Our approach provides formal util- 
ity guarantees for a class of queries and also supports 
a variety of applications including general OLAP, 
classification and record linkage. 



3 Preliminaries and Definitions 

In this section, we formally introduce the definitions 
of differential privacy, the data model and the queries 
we consider, as well as a formal utility notion called 
(e, (5) -usefulness. Matrices and vectors are indicated 
with bold letters (e.g H, x) and their elements are 
indicated as Hij or Xi. While we will introduce 
mathematical notations in this section and subsequent 
sections. Table |3] lists the frequently-used symbols for 
references. 
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TABLE 1 
Frequently used symbols 



Symbol 


Description 


n 


number of records in the dataset 


m 


number of cells in the data cube 




original count of cell «{!<«< m) 


Vi 


released count of cell i in cell histogram 


Vp 


released count of partition p in subcube histogram 


rip 


size of partition p 


s 


size of query range 


a, ai, Q2 


differential privacy parameters 


7 


smoothness parameter 



3.1 Differential Privacy 

Definition 3.1 (a-Differential privacy iO); In the in- 
teractive model, an access mechanism A satisfies a- 
differential privacy if for any neighboring databases^ 
Di and D2, for any query function Q, r C Range{Q), 
Aq{D) is the mechanism to return an answer to query 
Q{D), 

Pr[AQ{Di) = r] < e"Fr[^Q(i?2) = r] 

In the non-interactive model, a data release mecha- 
nism A satisfies a-differential privacy if for all neigh- 
boring database Di and D2, and released output D, 

Pr[A{Di) = D]< e'^Pr[A{D2) = D] 

Laplace Mechanism. To achieve differential privacy, 
we use the Laplace mechanism [6J that adds random 
noise of Laplace distribution to the true answer of a 
query Q, Aq{D) = Q{D) + N, where N is the Laplace 
noise. The magnitude of the noise depends on the 
privacy level and the query's sensitivity. 

Definition 3.2 (Sensitivity): For arbitrary neighbor- 
ing databases Di and D2, the sensitivity of a query Q, 
denoted by Sq, is the maximum difference between 
the query results of Di and D2, 

Sq = max\QiDi) - Q{D2)\ (1) 

To achieve a-differential privacy for a given query 
Q on dataset D, it is sufficient to return Q{D) + in 
place of the original result Q{D) where we draw N 
from Lap^Sq/a) [6|. 

Composition. The composability of differential pri- 
vacy [11] ensures privacy guarantees for a sequence 
of differentially-private computations. For a general 
series of analysis, the privacy parameter values add 
up, i.e. the privacy guarantees degrade as we expose 
more information. In a special case that the analyses 
operate on disjoint subsets of the data, the ultimate 
privacy guarantee depends only on the worst of the 
guarantees of each analysis, not the sum. 

1. We use the definition of unbounded neighboring databases 
\8i consistent with |11| which treats the databases as multisets of 
records and requires their symmetric difference to be 1. 
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Theorem 3.1 (Sequential Composition illlt ); Let Mi 
each provide ai-differential privacy. The sequence of 
Mi provides {J^i a^'differential privacy. 

Theorem 3.2 (Parallel Composition [11 ]): If Di are dis- 
joint subsets of the original database and Mi provides 
a-differential privacy for each Di, then the sequence 
of Mi provides a-differential privacy. 

Differential Privacy Interface. A privacy interface 
such as PINQ [11 1 can be used to provide a differ- 
entially private interface to a database. It provides 
operators for database aggregate queries such as 
count (NoisyCount) and sum (NoisySum) which uses 
Laplace noise and the exponential mechanism to en- 
force differential privacy. It also provides a Partition 
operator that can partition the dataset based on the 
provided set of candidate keys. The Partition operator 
takes advantage of parallel composition and thus the 
privacy costs do not add up. 

3.2 Data and Query Model 

Data Model. Consider a dataset with N nominal or 
discretized attributes, we use an A^-dimensional data 
cube, also called a base cuboid in the data ware- 
housing literature ||26), ||2T|, to represent the aggregate 
information of the data set. The records are the points 
in the A^-dimensional data space. Each cell of a data 
cube represents an aggregated measure, in our case, 
the count of the data points corresponding to the 
multidimensional coordinates of the cell. We denote 
the number of cells by m and m = |dom(Ai)| * • • • * 
\dora{AN)\ where \dom(^Ai)\ is the domain size of 
attribute Ai. We use the term "partition" to refer to 
any sub-cube in the data cube. 

Income 
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20-30 30-40 40-50 Age 

Fig. 2. Running example: original data represented in 
a relational table (left) and a 2-dimensional count cube 
(right) 

Figure |2] shows an example relational dataset with 
attribute age and income (left) and a two-dimensional 
count data cube or histogram (right). The domain val- 
ues of age are 20-^30, 30~40 and 40--50; the domain 
values of income are Q^IQK, 10A'-20A' and > 2QK. 
Each cell in the data cube represents the population 
coimt corresponding to the age and income values. 

Query Model. We consider linear counting queries 
that compute a linear combination of the count values 
in the data cube based on a query predicate. We can 



represent the original data cube, e.g. the counts of all 
cells, by an m-dimensional column vector x shown 
below. 

x = [ 10 21 37 20 53 ]^ 

Definition 3.3 (Linear query iE4F ): A linear query Q 
can be represented as an m-dimensional boolean vec- 
tor Q = [qi . . . Qn] with each qi e M. The answer to a 
linear query Q on data vector x is the vector product 
Qx = qixi H h qnXn. 

In this paper, we consider counting queries with 
boolean predicates so that each qi is a boolean variable 
with value or 1. The sensitivity of the counting 
queries, based on equation Q, is S'g = 1. We denote 
s as the query range size or query size, which is the 
number of cells contained in the query predicate, and 
we have s = |Q|. For example, a query Qi asking 
the population count with age = [20, 30] and income 
> 30fc, corresponding to xi, is shown as a query 
vector in Figure |3] The size of this query is 1. It also 
shows the original answer of Qi and a perturbed 
answer with Laplace noise that achieves a-differential 
privacy. We note that the techniques and proofs are 
generalizable to real number query vectors. 
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Qi = [ 1 ] 
Qix = Xl 
Aq, = Qix + A' (q/5'qj ) 
Sq, = 1 
A^(q/SqJ = /V(q./1) 
A^i = 10 + JV(q) 
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Fig. 3. Example: a linear counting query 

Definition 3.4 (Query matrix k24I ): A query matrix is 
a collection of linear queries, arranged by rows to 
form an p X TO matrix. 

Given a p x m query matrix H, the query answer 
for H is a length-p column vector of query results, 
which can be computed as the matrix product Hx. 
For example, an 7n x m identity query matrix !,„ will 
result in a length-TO column vector consisting of all 
the cell counts in the original data vector x. 

A data release algorithm, consisting of a sequence 
of designed queries using the differential privacy in- 
terface, can be represented as a query matrix. We will 
use this query matrix representation in the analysis of 
our algorithms. 

3.3 Utility Metrics 

We formally analyze the utility of the released data 
by the notion of (e, (5) -usefulness |9|. 

Definition 3.5 ({e,6)-usefulness 19]): A database 
mechanism A is (e, 5)-useful for queries in class C if 
with probability 1 — <5, for every Q G C, and every 
database D, A{D) = D, \Q{D) ~ Q{D)\ < e. 
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In this paper, we mainly focus on linear counting 
queries to formally analyze the released histograms. 
We will discuss and experimentally show how the 
released histogram can be used to support other types 
of OLAP queries such as sum and average and other 
applications such as classification. 

3.4 Laplace Distribution Properties 

We include a general lemma on probability distribu- 
tion and a theorem on the statistical distribution of 
the summation of multiple Laplace noises, which we 
will use when analyzing the utility of our algorithms. 

Lemma 3.1: If Z ^ fi^)/ then aZ ^ ^f{z/a) where 
a is any constant. 

Theorem 3.3: \27\ Let /„(z,a) be the PDF of 
Sr=i^i("^) where Ni{a) are i.i.d. Laplace noise 
Lap(l/Q;), 
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Histogram 



fn{z,a) = 
2"r2(n) 



exp{—a\z\) 



Jo 



zl + ^r-'e-^'dv (2) 
2a 



4 MULTIDII\/IENSIONAL PARTITIONING 
4.1 Motivation and Overview 

For differentially private histogram release, a multi- 
dimensional histogram on a set of attributes is con- 
structed by partitioning the data points into mutually 
disjoint subsets called buckets or partitions. The counts 
or frequencies in each bucket is then released. Any ac- 
cess to the original database is conducted through the 
differential privacy interface to guarantee differential 
privacy. The histogram can be then used to answer 
random counting queries and other types of queries. 

The partitioning strategy will largely determine the 
utility of the released histogram to arbitrary counting 
queries. Each partition introduces a bounded Laplace 
noise or perturbation error by the differential privacy 
interface. If a query predicate covers multiple parti- 
tions, the perturbation error is aggregated. If a query 
predicate falls within a partition, the result has to be 
estimated assuming certain distribution of the data 
points in the partition. The dominant approach in 
histogram literature is making the uniform distribution 
assumption, where the frequencies of records in the 
bucket are assumed to be the same and equal to the 
average of the actual frequencies L28.I . This introduces 
an approximation error. 

Example. We illustrate the errors and the impact of 
different partitioning strategies through an example 
shown in Figure |4] Consider the data in Figure |2] As 
a baseline strategy, we could release a noisy count 
for each of the cells. In a data-aware strategy, as if 
we know the original data, the 4 cells, x5,xq,xs, xg, 
can be grouped into one partition and we release a 
single noisy count for the partition. Note that the noise 
are independently generated for each cell or partition. 
Because the sensitivity of the counting query is 1 and 
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strategy 





0+N 


0+N 


Query: Count(Xj) = ? 






0+N 


0+N 


Answer: 0+N 
Query error: m 





Query: Count(xJ = ? 


0+N 






... 

Estimated Answer: O+n/4 




Query error: n/4 



Fig. 4. Baseline strategy vs. data-aware strategy 



the partitioning only requires parallel composition of 
differential privacy, the magnitude of the noise in 
the two approaches are the same. Consider a query, 
count(a;5), asking the count of data points in the 
region x^. For the baseline strategy, the query error 
is N which only consists of the perturbation error. 
For the data-aware strategy, the best estimate for the 
answer based on the uniform distribution assumption 
is + N/A. So the query error is N/A. In this case, 
the approximation error is because the cells in the 
partition are indeed uniform. If not, approximation 
error will be introduced. In addition, the perturbation 
error is also amortized among the cells. Clearly, the 
data-aware strategy is desired in this case. 

In general, a finer-grained partitioning will intro- 
duce smaller approximation errors but larger aggre- 
gated perturbation errors. Finding the right balance 
to minimize the overall error for a random query 
workload is a key question. Not surprisingly, finding 
the optimal multi-dimensional histogram, even with- 
out the privacy constraints, is a challenging problem 
and optimal partitioning even in two dimensions is 
NP-hard |29|. Motivated by the above example and 
guided by the composition theorems, we summarize 
our two design goals: 1) generate uniform or close 
to uniform partitions so that the approximation error 
within the partitions is minimized, essentially gen- 
erating a v-optimal histogram [30J; 2) carefully and 
efficiently use the privacy budget to minimize the 
perturbation error. In this paper, we first study the 
most fine-grained cell-based partitioning as a base- 
line strategy, which results in a equi-width histogram 
and does not introduce approximation error but only 
perturbation error. We then propose a 2-phase kd- 
tree (k-dimensional tree) based partitioning strategy 
that results in an v-optimal histogram and seeks to 
minimize both the perturbation and approximation 
error. 



4.2 A Baseline Cell Partitioning Strategy 

A simple strategy is to partition the data based on the 
domain and then release a noisy count for each cell 
which results in a equi-width cell histogram. Figure |5] 
illustrate this baseline strategy. The implementation is 
quite simple, taking advantage the Partition operator 
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followed by NoisyCount on each partition, shown in 
Algorithm [ij 
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Fig. 5. Baseline cell partitioning 
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Theorem 4.2: The released D of algorithm [l] main- 
tains (e, (5) -usefulness for linear counting queries, if 
a > '"^ ^ \ where m is the number of cells in the 
data cube. 

Proof: Given original data D represented as count 
vector X, using the cell partitioning with Laplace 
mechanism, the released data D can be represented 
as y = X + N, where N is a length-m column vector 
of Laplace noises drawn from Lap{b) with b — 1/a. 

Given a linear counting query Q represented as q 
with query size s (s < m), we have Q{D) = qx, and 



Algorithm 1 Baseline cell partitioning algorithm 

Require: a: differential privacy budget 

1. Partition the data based on all domains. 

2. Release NoisyCount of each partition using pri- 
vacy parameter a 



Privacy Guarantee. We present the theorem below for 
the cell partitioning algorithm which can be derived 
directly from the composibility theorems. 

Theorem 4.1: Algorithm [T| achieves a-differential 
privacy. 

Proof: Because every cell is a disjoint subset of 
the original database, according to theorem [3.2} it's 
a-differentially private. □ 

Error Quantification. We present a lemma followed 
by a theorem that states a formal utility guarantee of 
cell-based partitioning for linear distributive queries. 

Lemma 4.1: If Ni {i = 1 . . .m) is a set of random 
variables i.i.d from Lap{b) with mean 0, given < 
e < 1, the following holds: 



Pr[Y^ \Nt\<e]>l-m- exp{- 



mb 



(3) 



Proof: Let ei — e/m, given the Laplace distribu- 
tion, we have 



Pr[|iV,|>ei] = 2X7iexp(-f) = e- 



then 



Pr[\N,\ <ei] = 1 - Pr[\N,\ > ei] = 1 



-ei/b 



If each \Ni\ < t\, we have Yl!i=\ 1^*1 < "^ • ei = £/ so 
we have 



Pr[^|iV,|<e]>Pr[|iV,|<ei]" = (l 



Let F{x) = (1 — .t)™ + mx — 1. The derivative of Fix) 
is F'(x) = -to(1 - x)™-^ + m = m(l - (1 - xY"-^) > 
when < x < 1. Note that < e'"^/'' < 1, so 
F{e-'^/'') > F{0) = 0. We get 

Recall ci = e/m, we derive equation ||3|. □ 



QiD) = Qy = Qx + QN - Qx + ^ |iV,| 
With Lemma 14.11 we have 



Pr[\Q{D)-Q{D)\ < e] = Pr[^ |iV,| < e] > l-m-exp(-^) 

1=1 

If m • exp{-j^) < 6, then 

Pr[\Q{x, D) - g(x, b)\<e]>l-5 
In order for m • exp{-'^) < 6, given b = 1/a, we 



derive the condition a > IHil^Lii 
4.3 DPCube: Two-Phase Partitioning 



□ 
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Fig. 6. DPCube: 2-phase partitioning 

We now present our DPCube algorithm. DPCube 
uses an innovative two-phase partitioning strategy 
as shown in Figure |6] First, a cell based partition- 
ing based on the domains (not the data) is used to 
generate a fine-grained equi-width cell histogram (as 
in the baseline strategy), which gives an approxima- 
tion of the original data distribution. We generate a 
synthetic database Dc based on the cell histogram. 
Second, a multi-dimensional kd-tree based partition- 
ing is performed on Dc to obtain uniform or close to 
uniform partitions. The resulting partitioning keys are 
used to Partition the original database and obtain a 
NoisyCount for each of the partitions, which results 
in a v-optimal histogram [30]. Finally, given a user- 
issued query, an estimation component uses either the 
v-optimal histogram or both histograms to compute 
an answer. The key innovation of our algorithm is 
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that it is data-aware or adaptive because the multi- 
dimensional kd-tree based partitioning is based on the 
cell-histogram from the first phase and hence exploits 
the imderlying data distribution indirectly observed 
by the perturbed cell-histogram. Essentially, the kd- 
tree is based on an approximate distribution of the 
original data. The original database is not queried 
during the kd-tree construction which saves the pri- 
vacy budget. The overall privacy budget is efficiently 
used and divided between the two phases only for 
querying the NoisyCount for cells and partitions. 
Algorithm |2] presents a sketch of the algorithm. 

The key step in Algorithm |2] is the multi- 
dimensional partitioning step. It uses a kd-tree based 
space partitioning strategy that seeks to produce close 
to uniform partitions in order to minimize the es- 
timation error within a partition. It starts from the 
root node which covers the entire space. At each 
step, a splitting dimension and a split value from 
the range of the current partition on that dimen- 
sion are chosen heuristically to divide the space into 
subspaces. The algorithm repeats until a pre-defined 
requirement are met. In contrast to traditional kd-tree 
index construction which desires a balanced tree, our 
main design goal is to generate uniform or close to 
uniform partitions so that the approximation error 
within the partitions is minimized. Thus we propose 
a uniformity based heuristic to make the decision 
whether to split the current partition and to select 
the best splitting point. There are several metrics 
that can be used to measure the uniformity of a 
partition such as information entropy and variance. 
In our experiments, we use variance as the metric. 
Concretely, we do not split a partition if its variance is 
greater than a threshold, i.e. it is close to uniform, and 
split it otherwise. In order to select the best splitting 
point, we choose the dimension with the largest range 
and the splitting point that minimizes the accumu- 
lative weighted variance of resulting partitions. This 
heuristic is consistent with the goal of a v-optimal 
histogram which places the histogram bucket bound- 
aries to minimize the cumulative weighted variance 
of the buckets. Algorithm |3] describes the step 2 of 
Algorithm |2] 

Privacy Guarantee. We present the theorem below 
for the 2-phase partitioning algorithm which can be 
derived directly from the composibility theorems. 
Theorem 4.3: Algorithm |2] is a-differentially private. 
Proof: Step 2 and Step 5 are ai,Q;2-differentially 
private. So the sequence is a-differentially private 
because of theorem |3. 1 1 with a = ai + a2. □ 



Query Matrix Representation. We now illustrate how 
the proposed algorithm can be represented as a query 
matrix. We denote H as the query matrix generating 
the released data in our algorithm and we have H = 
[Hii;Hi], where Hi and Hn correspond to the query 
matrix in the cell partitioning and kd-tree partitioning 



Algorithm 2 2-phase partitioning algorithm 

Require: /3: number of cells; 
a: the overall privacy budget 
Phase I: 

1. Partition the original database based on all do- 
mains. 

2. get NoisyCount of each partition using privacy 
parameter ai and generate a S5mthetic dataset Dc- 
Phase II: 

3. Partition Dc by algorithm |3] 

4. Partition the original database based on the 
partition keys returned from step 3. 

5. release NoisyCount of each partition using pri- 
vacy parameter a2 = a — ai 



Algorithm 3 Kd-tree based v-optimal partitioning 

Require: input database; variance threshold; 
if variance of Df > £,o then 

Find a dimension and splitting point m which 
minimizes the cumulative weighted variance of 
the two resulting partitions; 
Split Dt into Dn and Dt2 by to. 
partition Dt^ and Dt^ by algorithm |3] 
end if 



phase respectively. Hj is an Identity matrix with to 
rows, each row querying the count of one cell. Hn 
contains all the partitions generated by the second 
phase. We use N(a) to denote the column noise vector 
and each noise Ni is determined by a differential 
privacy parameter (ai in the first phase and a2 in 
the second phase respectively). The released data is 
y = Hx + N. It consists of the cell histogram yj in 
phase I and the v-optimal histogram yj in phase II: 
yi = (Hi)x + N(ai), y^ = (Hn)x + N(a2). 

Using our example data from Figure |2j the query 
matrix H consisting of Hi and Hn and the released 
data consisting of the cell histogram yj and subcube 
histogram yjj are shown in Equation Jlj |5j. The his- 
tograms are also illustrated in Figure ItT 



Hi — 



Hll — 



10 + iVi(c«i) 
21 + ]V2(ai) 
37+ ]V3(qi) 
20 + ]V4(ai) 
+ NsCqi) 
+ NgCci) 
53 + iV7(Qi) 
+ NgCci) 
+ NgCci) 



(4) 



51 + Nio(c,2) 
37 + Nii(c,2) 
53 + Ni2(<»2) 
+ Ni3(q2) 



(5) 



4.4 Query Estimation and Error Quantification 

Once the histograms are released, given a random 
user query, the estimation component will compute 
an answer using the released histograms. We study 
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Fig. 7. Example: released cell histogram (left) and Fig. 8. Example: query estimation with uniform estima- 
subcube histogram (right) tio" using subcube histogram 



two techniques and formally quantify and compare 
the errors of the query answer they generate. The 
first one estimates the query answer using only the 
subcube histogram assuming a uniform distribution 
within each partition. As an alternative approach, we 
adopt the least squares (LS) method, also used in |24|, 
[23 1, to our two-phase strategy. The basic idea is to use 
both cell histogram and subcube histogram and find 
an approximate solution of the cell counts to resolve 
the inconsistency between the two histograms. 

Uniform Estimation using Subcube Histogram. 

Based on our design goal to obtain uniform partitions 
for the subcube histogram, we make the uniform dis- 
tribution assumption in the subcube histogram, where 
the counts in each cell within a partition are assumed 
to be the same. Given a partition p, we denote Up as 
the size of the partition in number of cells. Hence the 
count of each cell within this partition is Up/up where 
Hp is the noisy count of the partition. We denote xh 
as the estimated cell counts of the original data. If a 
query predicate falls within one single partition, the 
estimated answer for that query is ^j/p where s is 
the query range size. Given a random linear query 
that spans multiple partitions, we can add up the 
estimated counts of all the partitions within the query 
predicate. In the rest of the error analysis, we only 
consider queries within one partition as the result can 
be easily extended for the general case. 

Figure |8] shows an example. Given a query Q2 on 
population count with age = [30, 40], the original 
answer is X2 + + xs- Using the subcube histogram, 
the query overlaps with partition yio and partition 7/13. 
Using the uniform estimation, the estimated answer 
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Error Quantification for Uniform Estimation. We 

derive the (e, ^)-usefulness and the expected error for 
the uniform estimation method. To formally under- 
stand the distribution properties of input database 
that guarantees the optimality of the method, we first 
define the smoothness of the distribution within a 
partition or database similar to Il3l1l . The difference 
is that [31 1 defines only the upper bound, while our 
definition below defines the bound of differences of 
all cell counts. Intuitively, the smaller the difference. 



the smoother the distribution. 

Definition 4.1 (^f-smoothness): Denote x the original 
counts of cells in one partition, yxi,Xj e x, if \xj — 
Xi\ < 7/ then the distribution in the partition satisfies 
7-smoothness. 

We now present a theorem that formally analyzes 
the utility of the released data if the input database 
satisfies 7-smoothness, followed by a theorem for the 
general case. 

Theorem 4.4: For 7-smooth data x, given a query q 
with size s, Up the size of the partition, the uniform 
estimation method is (e, (5)-useful if equation ||6j holds; 
the upper bound of the expected absolute error E(e^f ) 
is shown in equation 



7< (e + 



logS 



) /min{s, Up 



^{^h) < j[min{s, Hp — s)] + 



a2% 



(6) 
(7) 



Proof: Given a query vector Q, the answer using 
the original data is Qx, the estimated answer using 
the released partition is ;p?/p where Up is the released 
partition count, Up is the partition size, and s is the 
query size. The released count yp = X]"=i + N{a2). 
So we have the absolute error as 



e/f = |— 2/p - Qx| = 



— iV(a2)| 



^ - Qx) 

According to the definition of 7-smoothness, 

Vi,j, x,\ < 7, then l^^EZi^^ - Qx| < 

min{s, Up — s)j. Because of the symmetry of the PDF 
of N{a2), we have 



Eif < \min{s, Up — 5)7 -f 
< min{s, Up — s)"f + 

By Lemma 



—N{a2)\ 

Up 

—Nia2)\ 

Up 



3.1 



we know the PDF of ^iV(Q;2). 
satisfy (e, (5) -usefulness, we require Pr{eH < e) > 1 
then we derive the condition as in equation ||6}. 
The expected absolute error is 



To 



E{eH) 



Qx 



-E|iV(a2)| 



< ■min{s, Up — 5)7 H E|iV(a2)| 
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Based on the PDF of :^N{a2), we have 

E|Ar(a2)| = l/a2 

and hence derive equation |[7|. □ 
From theorem 4.4 we can conclude that if the input 
data is smoothly distributed or very sparse, 7 would 
be small, the error would be small. In this case, our 
algorithm achieves the best result. In the general case, 
if we do not know the distribution properties of the 
input data, we present Theorem 4.5 to quantify the 
error. 

Theorem 4.5: Given a linear counting query Q, the 
expected absolute error for the uniform estimation 
method, E(e^f), is a function of (01,5,77) shown in 
equation dSl: 



E(e_H-) = J fs{z,ai)\r] + z\dz 



(8) 



where 77 = i^Vp — Qji, Up is the released count of 
the partition in the subcube histogram, np the size of 
the partition, s is the size of the query, and yj is the 
released counts of the cell histogram. 

Proof: Given a query Q, the answer using the 
original data is Qx, the estimated answer using the 
released partition is ;f-2;p. The released partition count 

is Vp = Sr=i + N{a2)- The released cell counts for 
the cell histogram in the first phase is y; = x + N(q;i). 
So we have 



-Vp - QA 



-Z/p-Q(yi-N(ai))| 



l(-2/p-Qyi) + E^*("i)l 



Denote = -^Vp ~ Qy^, which is the difference 
(inconsistency) between the estimated answers using 
the cell histogram and the subcube histogram, then 
tR — \t1 + -^^Q^i)!- By equation (jij , we have 

E(eH) in equation □ 

Least Square Estimation using Cell Histogram and 
Subcube Histogram. The least square (LS) method, 
used in 1241 , ||23l , finds an approximate (least square) 
solution of the cell counts that aims to resolve the 
inconsistency between multiple differentially private 
views. In our case, the cell histogram and the subcube 
histogram provide two views. We derive the theoret- 
ical error of the least square method and compare it 
with the uniform estimation method. Note that the 
error quantification in [24J is only applicable to the 
case when ai and a2 are equal. We derive new result 
for the general case in which ai and a2 may have 
different values. 

Theorem 4.6: Given a query Q, a least square esti- 
mation xls based on the cell histogram and subcube 
histogram, the expected absolute error of the query 
answer, E(eLs), is a function of {ai,a2, rip, s) in equa- 
tion ||9|, where s is the size of Q, and Up is the size of 



the partition. 



{rip + 1)= 



s'^{np + 1 - s) 



11,, . {e-z){np + l) 



{z-y){np + l) w.yl'^p + l) N . , , 
Js( — ; ,a2)dydzde 



rip + 1 — s 



(9) 



Proof: Given our two phase query strategy, the 
query matrix for the partition is H [ones{l, rip); /„p], 
where iip is the partition size. Using the least square 
method in 124J, we solve H+ = {H^H)^H^, we have 
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We compute the least square estimation based on the 
released data y as xls = H^y. So the query answer 
using the estimation is 

QkLs = QH+y = QH+(Hx + N(a)) = Qx + gH+N(a) 
The absolute error is 

1 



bp 



S — 71 

Qxi5 - Qx = — -7V(a2) + " 
rip + 1 Up 



1 



i=l 



By Lemma 3.1 and equation ||2j, we know the 
PDF of^^7V(a2), =iji^n=i^(«i) and 
X^Si " ^(o^i)' then by convolution formula, 
we have equation ||9|. □ 
We will plot the above theoretical results for both 
uniform estimation and least square estimation with 
varying parameters in Section |6] and demonstrate the 
benefit of the uniform estimation method, esp. when 
data is smoothly distributed. 



5 Applications 

Having presented the multidimensional partitioning 
approach for differentially private histogram release, 
we now briefly discuss the applications that the re- 
leased histogram can support. 

OLAP. On-line analytical processing (OLAP) is a key 
technology for business-intelligence applications. The 
computation of multidimensional aggregates, such as 
count, sum, max, average, is the essence of on-line 
analytical processing. The released histograms with 
the estimation can be used to answer most common 
OLAP aggregate queries. 
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Classification. Classification is a common data anal- 
ysis task. Several recent works studied classifica- 
tion with differential privacy by designing classifier- 
dependent algorithms (such as decision tree) 1321 , Il33ll . 
The released histograms proposed in this paper can 
be also used as training data for training a classifier, 
offering a classifier-independent approach for clas- 
sification with differential privacy. To compare the 
approach with existing solutions 132], |33|, we have 
chosen an 1D3 tree classifier algorithm. However, the 
histograms can be used as training data for any other 
classifier. 

Blocking for Record Linkage. Private record linkage 
between datasets owned by distinct parties is another 
common and challenging problem. In many situa- 
tions, uniquely identifying information may not be 
available and linkage is performed based on matching 
of other information, such as age, occupation, etc. 
Privacy preserving record linkage allows two par- 
ties to identify the records that represent the same 
real world entities without disclosing additional in- 
formation other than the matching result. While Se- 
cure Multi-party Computation (SMC) protocols can be 
used to provide strong security and perfect accuracy, 
it incurs prohibitive computational and communica- 
tion cost in practice. Typical SMC protocols require 
0{n * m) cryptographic operations where n and m 
correspond to numbers of records in two datasets. 
In order to improve the efficiency of record linkage, 
blocking is generally in use according to [34J . The pur- 
pose of blocking is to divide a dataset into mutually 
exclusive blocks assuming no matches occur across 
different blocks. It reduces the number of record pairs 
to compare for the SMC protocols but meanwhile 
we need to devise a blocking scheme that provides 
strong privacy guarantee. [19 1 has proposed a hy- 
brid approach with a differentially private blocking 
step followed by an SMC step. The differentially 
private blocking step adopts tree-structured space 
partitioning techniques and uses Laplace noise at 
each partitioning step to preserve differential privacy. 
The matching blocks are then sent to SMC protocols 
for further matching of the record pairs within the 
matching blocks which significantly reduces the total 
number of pairs that need to be matched by SMC. 

The released histograms in this paper can be also 
used to perform blocking which enjoys differential 
privacy. Our experiments show that we can achieve 
approximately optimal reduction ratio of the pairs 
that need to be matched by SMC with appropriate 
setting for the threshold ^o- We will examine some 
empirical results in Section |6] 

6 Experiment 

We first simulate and compare the theoretical query 



benefits of our approach (Section [6J^ . We then present 
a set of experimental evaluations of the quality of 
the released histogram in terms of weighted variance 
(Section 6.2 1, followed by evaluations of query error 
against random linear counting queries and a com- 
parison with existing solutions (Section |6.3[ . Finally, 
we also implemented and experimentally evaluate 
the two additional applications using the released 
histograms, classification and blocking for record link- 
age, and show the benefit of our approach (Section 

6.1 Simulation Plots of Theoretical Results 

As some of the theoretical results in Section 144] con- 
sists of equations that are difficult to compute for 
given parameters, we use the simulation approach to 
show the results and the impact of different param- 
eters in this section. Detailed and sophisticated tech- 
niques about the simulation approach can be found 
in ll35l and Il36ll . Table 6.1 shows the parameters used 
in the simulation experiment and their default values. 

TABLE 2 
Simulation parameters 



parameter 


description 


default value 


rip 


partition size 


rip = 11 


s 


query size 


s < rip 




diff. privacy parameters 


Qi = 0.05, 02 = 0.15 


7 


smoothness parameter 


7 = 5 




inconsistency between 
cell and subcute histogram 


r; = 5 



error results from Section 4.4 for counting queries that 



falls within one partition to show the properties and 



Metric. We evaluate the error of absolute count 
queries. Recall that E(eLs) is the expected error of 
the least square estimation method; maa;(E(e/f )) is the 
upper bound of expected error of the uniform esti- 
mation method when the data is 7-smooth; E(e/f ) is 
the expected error of the uniform estimation method 
in the general case. Note that E(eLs), viiax{^{tuS), 
and E(e//) are derived from equation ||9]l, l|7|, (jSj 
respectively. 

Impact of Query Size. We first study the impact 
of the query size. Figure |9] shows the error of the 
uniform and least square solutions with respect to 
varying query size for 7-smooth data and general case 
(any distribution) respectively. We can see that the 
highest error appears when the query size s is half 
of the partition size np. When 7 is small, i.e. data is 
smoothly distributed, the uniform estimation method 
outperforms the least square method. In the general 
case when we do not have any domain knowledge 
about the data distribution, it is beneficial to use both 
cell histogram and subcube histogram to resolve their 
inconsistencies . 

Impact of Privacy Budget Allocation. We now study 
the impact of the allocation of the overall differen- 
tial privacy budget a between the two phases. The 
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(a) 7-smooth distribution (b) any distribution (a) vs. partition size Up (b) vs. smoothness 7 



Fig. 9. Query error vs. query size s Fig. 1 1 . Query error for 7-smootln distribution 



overall budget is fixed with a — 0.2. Figure 10 shows size s is set to be nearly half size of the partition 

size Up as default, which magnifies the impact of the 



the error of the uniform and least square solutions 
with respect to varying query size for 7-smooth data 
and general case (any distribution) respectively. For 
the LS method, equally dividing a between the two 
phases or slightly more for cell-based partitioning 
works better than other cases. The results for the 
uniform estimation method present interesting trends. 
For smoothly distributed data, a smaller privacy bud- 
get for the partitioning phase yields better result. 
Intuitively, since data is smoothly distributed, it is 
beneficial to save the privacy budget for the second 
phase to get a more accurate overall partition count. 
On the contrary, for the random case, it is beneficial 
to spend the privacy budget in the first phase to get 
a more accurate cell counts, and hence more accurate 
partitioning. 




0.05 0.1 0.15 



0.05 0.1 0.15 0.2 



(a) 7-smooth distribution (b) any distribution 
Fig. 10. Query error vs. privacy budget allocation ai 

Impact of Partition Size. For 7-smooth data, the 
expected error is dependent on the partition size Up 
and the smoothness level 7. Figure [TT|a) shows the 
error of the uniform and least square solutions with 
respect to varying partition size Up for 7-smooth data. 
We observe that the error increases when the partition 
size increases because of the increasing approximation 
error within the partition. Therefore, a good partition- 
ing algorithm should avoid large partitions. 

Impact of Data Smoothness. For 7-smooth data, 
we study the impact of the smoothness level 7 on 
the error bound for the uniform estimation method. 
Figure [TT|b) shows the maximum error bound with 
varying 7. We can see that the more smooth the data, 
the less error for released data. Note that the query 



smoothness. We observed in other parameter settings 
that the error increases only slowly for queries with 
small or big query sizes. 

Impact of Inconsistency. For data with any (un- 
known) distribution, £(6^) is a function of ij, the level 
of inconsistency between the cell histogram and the 
subcube histogram. In contrast to the the 7-smooth 
case in which we have a prior knowledge about 
the smoothness of the data, here we only have this 
observed level of inconsistency from the released data 
which reflects the smoothness of the original data. 
Figure 12 shows the error for the uniform estimation 
method with varying 77. Note that the error increases 
with increasing 77 and even when 77 = 10, the error 
is still lager than the error of least square method in 
Figure 11 




80 100 



Fig. 1 2. Query error vs. level of inconsistency 77 for any 
distribution 



6.2 Histogram Variance 

We use the Adult dataset from the Census ||37| . 
All experiments were run on a computer with Intel 
P8600(2*2.4 GHz) CPU and 2GB memory For compu- 
tation simplicity and smoothness of data distribution, 
we only use the first 10^ data records. 

Original and Released Histograms. We first present 
some example histograms generated by our algorithm 
for the Census dataset. Figure 13 a) shows the original 
2-dimensional histogram on Age and Income. Figure 
13 b) shows a cell histogram generated in the first 
phase with ai = 0.05. Figure 13 c) shows a subcube 
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histogram generated in the second phase with esti- 
mated cell counts using uniform estimation, in which 
each horizontal plane represents one partition. Figure 
13 d) shows an estimated cell histogram using the LS 
estimation using both cell histogram and subcube his- 
togram. We systematically evaluate the utility of the 
released subcube histogram with uniform estimation 
below. 




(a) Original histogram (b) Cell histogram 




(c) Subcube histogram (d) Estimated cell histogram 
Fig. 13. Original and released histograms 

Metric. We now evaluate the quality of the released 
histogram using an application-independent metric, 
the weighted variance of the released subcube his- 
togram. Ideally, our goal is to obtain a v-optimal 
histogram which minimizes the weighted variance so 
as to minimize the estimation error within partitions. 
Formally, the weighted variance of a histogram is 
defined as = Y^^=i ^i^ir where p is the number 
of partitions, Xi is the number of data points in 
the i-th partition, and Vi is the variance in the i-th 
partition Il30ll . 



large, less partitioning is performed, i.e. more data 
points are grouped into one bucket, which increases 
the variance. 



Impact of Privacy Budget Allocation. Figure 14 b) 
shows the weighted variance with respect to varying 
privacy budget in the first phase ai (with fixed ^o- 
Note that the overall privacy budget is fixed. We see 
that the correlation is not very clear due to the ran- 
domness of noises. Also the is fixed which does not 
reflect the different magnitude of noise introduced by 
different ai. Generally speaking, the variance should 
decrease gradually when ai increases because large 
ai will introduce less noise to the first phase, which 
will lead to better partitioning quality. 

6.3 Query Error 

We evaluate the quality of the released histogram 
using random linear counting queries and measure 
the average absolute query error and compare the 
results with other algorithms. We use the Age and 
Income attributes and generated 10^ random count- 
ing queries to calculate the average query error. The 
random queries consist of two range predicates on 
the two attributes respectively. We also implemented 
an alternative kd-tree strategy similar to that used in 
[19 1, referred to as hierarchical kd-tree, and another 
data release algorithm [23 1, referred to as consistency 
check, for comparison purposes. 



-Query error 



-Query error 




(a) VS. threshold (b) vs. privacy budget ai 

Fig. 1 5. Ouery error and impact of parameters 




((a) vs. threshold (b) vs. privacy budget ai 

Fig. 14. Histogram variance and impact of parameters 

Impact of Variance Threshold. Figure[l4|a) shows the 
weighted variance of the released subcube histogram 
with respect to varying variance threshold value 
used in our algorithm. As expected, when becomes 



Impact of Variance Threshold. We first evaluate the 
impact of the threshold value on query error. Figure 
15 a) shows the average absolute query error with 
respect to varying threshold values. Consistent with 
Figure 14 we observe that the query error increases 
with an increasing threshold value, due to the in- 
creased variance within partitions. 

Impact of Privacy Budget Allocation. We next eval- 
uate the impact of how to allocate the overall privacy 
budget a into ai, for the two phases in the 



algorithm. Figure 15 b) shows the average absolute 
query error vs. varying a\. We observe that small ol\ 
values yield better results, which complies with our 
theoretical result for 7-smooth data in figure 10 This 



verifies that the real life Adult dataset has a somewhat 
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smooth distribution. On the other hand, for data with 
unknown distribution, we expect ai cannot be too 
small as it is more important for generating a more 
accurate partitioning. 

Comparison with Other Works. We compare our 
approach with two representative approaches in ex- 
isting works, the hierarchical kd-tree strategy used in 
Ifl9l , and the hierarchical partitioning with consistency 
check in li23J . Figure 16 shows the average abso- 
lute query error of different approaches with respect 
to varying privacy budget a. We can see that the 
DPCube algorithm achieves best utility against the 
random query workload because of its efficient 2- 
phase use of the privacy budget and the v-optimal 
histogram. 



IDS implementation such that it can use histogram as 
its input. To test the utility of the differentially private 
histogram generated from our algorithm, we used 
it to train an IDS classifier called DPCube histogram 
ID3. As a comparison, we implemented an interactive 
differentially private IDS classifier, private interactive 
ID 3, introduced by Friedman et al. ISSl . 




ID3 
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DPCube iD3 
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Fig. 17. Classification accuracy vs. privacy budget a 
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(a) VS. privacy budget a (b) vs. query size 
Fig. 1 6. Query error for different approaclnes 

We also experimented with query workloads with 
different query sizes for the DPCube approach and the 
consistency check approach to closely examine their 
differences. Figure 16 b) shows the average absolute 
query error with respect to varying query sizes for 
the two approaches. Note that the query size is in log 
scale. As proven in our theoretical result in Figure 
11 the query error of our algorithm would increase 
with increasing query size. We can see that for queries 
with reasonable sizes, our 2-phase algorithm achieves 
better results than the consistency check algorithm. 
On the other hand, the consistency check algorithm is 
beneficial for large size queries because our algorithm 
favors smaller partitions with small variances which 
will result in large aggregated perturbation errors for 
large queries that spans multiple partitions. 

6.4 Additional Applications 

Classification. We evaluate the utility of the released 
histogram for classification and compare it with other 
differentially private classification algorithms. In this 
experiment, the dataset is divided into training and 
test subsets with S0162 and 15060 records respectively. 
We use the work class, martial status, race, and sex 
attributes as features. The class was represented by 
salary attribute with two possible values indicating if 
the salary is greater than $50k or not. 

For this experiment, we compare several classifiers. 
As a baseline, we trained a ID3 classifier from the orig- 
inal data using Weka |S8]. We also adapted the Weka 



Figure 17 shows the classification accuracy of the 
different classifiers with respect to varying privacy 
budget a. The original ID3 classifier provides a base- 
line accuracy at 76.9%. The DPCube ID3 achieves 
slightly worse but comparable accuracy than the base- 
line due to the noise. While both the DPCube IDS 
and the private interactive IDS achieve better accu- 
racy with increasing privacy budget as expected, our 
DPCube IDS outperforms the private interactive IDS 
due to its efficient use of the privacy budget. 

Blocking for Record linkage. We also evaluated the 
utility of the released histogram for record linkage 
and compared our method against the hierarchical kd- 
tree scheme from [19J. The attributes we considered 
for this experiment are: age, education, wage, marital 
status, race and sex. As the histogram is used for the 
blocking step and all pairs of records in matching 
blocks will be further linked using an SMC protocol, 
our main goal is to reduce the total number of pairs 
of records in matching blocks in order to reduce the 
SMC cost. We use the reduction ratio used in 119,1 as 
our evaluation metric. It is defined as follows: 



reduction ratio = 1 



(10) 



where n, {nii) corresponds to the number of records 
in dataset 1 (resp. 2) that fall into the ith block, and 
k is the total number of blocks. 

We compared both methods by running experi- 
ments with varying privacy budget (a) values (using 
the first 2 attributes of each record) and with varying 
numbers of attributes (with a fixed to 0.1). Figure 
18 a) shows the reduction ratio with varying privacy 
budget. Both methods exhibit an increasing trend 
in reduction ratio as the privacy budget grows but 
our 2-phase v-optimal histogram consistently outper- 
forms the hierarchical kd-tree approach and maintains 
a steady reduction ratio around 85%. Figure 18 b) 
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shows the reduction ratio with varying number of 
attributes (dimensions). As the number of attributes 
increases, both methods show a drop in the reduction 
ration due to the sparsification of data points, thus 
increasing the relative error for each cell/partition. 
However, our DPCube approach exhibits desirable ro- 
bustness when the dimensionality increases compared 
to the hierarchical kd-tree approach. 
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Fig. 18. Reduction ratio of blocking for record linkage 



7 Conclusions and Future works 

We have presented a two-phase multidimensional 
partitioning algorithm with estimation algorithms for 
differentially private histogram release. We formally 
analyzed the utility of the released histograms and 
quantified the errors for answering linear counting 
queries. We showed that the released v-optimal his- 
togram combined with a simple query estimation 
scheme achieves bounded query error and supe- 
rior utility than existing approaches for "smoothly" 
distributed data. The experimental results on using 
the released histogram for random linear counting 
queries and additional applications including classi- 
fication and blocking for record linkage showed the 
benefit of our approach. As future work, we plan to 
develop algorithms that are both data- and workload- 
aware to boost the accuracy for specific workloads 
and investigate the problem of releasing histograms 
for temporally changing data. 
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